Towards High-performance Haplotype Assembly for Future Sequencing
نویسندگان
چکیده
The problem of Haplotype Assembly is an essential step in human genome analysis. Being the well known MEC model for its solution NP-hard, it is currently addressed by using algorithms that grow exponentially with the length of DNA fragments obtained by the sequencing process. Technological improvements will reduce fragmentation, increase fragment length and make such computational costs worst. WHATSHAP is a recently proposed novel approach which moves complexity from fragment length to fragment sovrapposition, improving the perspective of computational costs, but Haplotype Assembly still remains a demanding computational problem. Directions towards high-performance computing Haplotype Assembly for future sequencing, based on parallel WHATSHAP, are discussed in this paper. 1 Scientific Background Human genome is diploid, i.e. each chromosome comes in two copies, each of which is a haploid chromosome coming from one of the two parents (one allele per parent). Single Nucleotide Polymorphisms (SNPs) are single DNA positions in a chromosome where the nucleotide can differ in distinct individuals, e.g. in parents, and therefore be different in the two DNA copies of a single individual. Haplotyping is the task of phasing the SNPs, i.e., assigning their values to either of the two DNA copies (alleles) inherited from parents. Genomic data obtained from a sequencing experiment is a mixture of the two copies of the chromosomes in the form of many DNA fragments coming from either of the two alleles, called reads, which have not been assembled yet into contiguous sequences of whole chromosomes. When SNPs phasing is performed directly on such raw sequencing reads, we talk about haplotype assembly: each read is assigned to one of the two alleles. Therefore, reads that exhibit different values on the same SNP position must necessarily belong to different alleles. Arbitrarily re-labelling the alleles with 0 and 1, the input data can be represented as a n × m matrix F , with n the number of reads and m the number of SNPs sites. The i-th read (i-th row of F ) is represented with a string in the alphabet {0, 1,−} where a value 0 (resp. 1) at column j tells that the i-th read has the value of allele 0 (resp. 1). The value − means that the read does not cover the j SNP position. A conflict is an SNP position where two reads rp and rq have different values (that is, a 0 and a 1). Reads that have distinct allele values at a common SNP are assumed to come from different chromosome copies. A correct haplotype assembly corresponds to a bipartition of the rows of F into two sets F0 and F1 such that each one of these two sets is conflict free. Unfortunately, due to sequencing errors, in real data such bipartition Proceedings of CIBB 2014 2 does not exist. The problem thus becomes that of detecting a minimal amount of errors to be corrected (or removed) in order to have a conflict free bipartition. In literature, there are several models for the haplotype assembly problem corresponding to different optimization problems: Minimal Error Correction (MEC) corrects the minimum number of errors, by turning 0s into 1s or viceversa; Minimal Error Removal (MER) removes the minimum number of errors, by turning 0s or 1s into −s; Minimal Fragment Removal (MFR) removes the minimum number of conflicting fragments. We focus on detecting sequencing errors (and not mapping errors, i.e. erroneous assignments of a read to a position in the genome), thus concentrating on MEC and MER. The two have been proved to be equivalent, and actually both can be reduced to finding a MAX-CUT in a graph and therefore are NP-hard. Several proposals and tools have been put forward to solve MEC in the last ten years, such as [11, 8] based on a greedy heuristic to assemble the haplotype of a genome, [5] a method to sample a set of likely haplotypes under the MEC model, and faster followup based on the definition of a graph [6], and an iterative greedy heuristic to optimize the MAX-CUT of that graph [4]. The latter outperforms [11, 8, 5] and shows similar accuracy to [5]. Other reductions of MEC to MAX-SAT are in [10, 7]. Since the problem is NP-hard, all practical solutions to MEC are either statistical/heuristics approaches, or are exact fixed-parameter tractable algorithms, in which case complexity turns out to be exponential in the number of SNPs per read or in the read length. Due to the way in which sequencing biotechnologies evolve providing everincreasing read length, methods with fixed parameter tractability exponentially linked to read length (or to the number of SNPs per read, which also grows with read length) will perform worse and worse with future-generation longer reads. In [12], some of the authors introduced WHATSHAP, the first exact fixed-parameter tractable algorithm for solving MEC that, importantly, is exponential in the sequencing coverage. This parameter is the maximum number of different reads that cover a single SNP position. WHATSHAP results to be quite accurate, due to the fact that it actually solves wMEC, a weighted generalisation of MEC in which a confidence degree is associated to each 0 and 1 and less confident values are the most likely to be corrected. WHATSHAP is still a computationally demanding algorithm. For instance, experiments with a coverage limited to up to 20× can be managed with a time cost in the order of the hour on a single core of a standard desktop machine. It is worth remarking that higher coverages may occur in practice and even small increases may have substantial impact. Interestingly for future perspectives, datasets with higher coverage are desirable since they could further improve the accuracy of WHATSHAP. Considering also that the analysis of a whole genome may require the solution of several (a few tens) independent instances of haplotype assembly, it is clearly worth exploring the possibility of a parallel version of WHATSHAP. 2 Materials and Methods
منابع مشابه
Solving a multi-objective mixed-model assembly line balancing and sequencing problem
This research addresses the mixed-model assembly line (MMAL) by considering various constraints. In MMALs, several types of products which their similarity is so high are made on an assembly line. As a consequence, it is possible to assemble and make several types of products simultaneously without spending any additional time. The proposed multi-objective model considers the balancing and sequ...
متن کاملHapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data
Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational ...
متن کاملTheory and Algorithms for the Haplotype Assembly Problem∗
Genome sequencing studies to date have generally sought to assemble consensus genomes by merging sequence contributions from multiple homologous copies of each chromosome. With growing interest in genetic variations, however, there is a need for methods to separate these distinct contributions and assess how individual homologous chromosome copies differ from one another. An approach to this pr...
متن کاملtrio-sga: facilitating de novo assembly of highly heterozygous genomes with parent-child trios
Motivation: Most DNA sequence in diploid organisms is found in two copies, one contributed by the mother and the other by the father. The high density of differences between the maternally and paternally contributed sequences (heterozygous sites) in some organisms makes de novo genome assembly very challenging, even for algorithms specifically designed to deal with these cases. Therefore, vario...
متن کاملWhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads
The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014